humaneval: by models

Home   Doc/Code

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model pass1 win_rate elo
claude-3-opus-20240229 82.9% 88.0% 1336.8
deepseek-coder-33b-instruct 81.7% 85.5% 1299.4
speechless-codellama-34b 77.4% 79.8% 1231.3
meta-llama-3-70b-instruct 77.4% 79.6% 1225.7
opencodeinterpreter-ds-33b 77.4% 78.8% 1219.2
claude-3-haiku-20240307 76.8% 76.6% 1198.2
gpt-3.5-turbo 76.8% 77.5% 1206.3
mixtral-8x22b-instruct-v0.1 76.2% 75.7% 1186.6
xwincoder-34b 75.6% 76.1% 1194.2
deepseek-coder-7b-instruct-v1.5 75.6% 76.2% 1189.9
code-millenials-34b 74.4% 72.3% 1154.3
opencodeinterpreter-ds-6.7b 74.4% 74.8% 1175.7
deepseek-coder-6.7b-instruct 74.4% 71.6% 1144.2
HuggingFaceH4--starchat2-15b-v0.1 73.8% 72.7% 1156.4
openchat 72.6% 70.6% 1141.2
white-rabbit-neo-33b-v1 72.0% 69.1% 1129.9
code-llama-70b-instruct 72.0% 68.8% 1129.4
speechless-coder-ds-6.7b 71.3% 69.8% 1130.4
codebooga-34b 71.3% 69.1% 1120.9
claude-3-sonnet-20240229 70.7% 66.1% 1100.4
mistral-large-latest 69.5% 64.2% 1085.8
Qwen--Qwen1.5-72B-Chat 68.3% 62.7% 1074.6
bigcode--starcoder2-15b-instruct-v0.1 67.7% 61.6% 1069.3
speechless-starcoder2-15b 67.1% 60.3% 1052.1
deepseek-coder-1.3b-instruct 65.9% 58.4% 1041.9
microsoft--Phi-3-mini-4k-instruct 64.6% 55.6% 1033.0
codegemma-7b-it 60.4% 48.0% 975.2
wizardcoder-15b 56.7% 41.5% 931.6
speechless-starcoder2-7b 56.1% 40.7% 921.3
code-13b 56.1% 41.0% 923.5
code-33b 54.9% 39.2% 906.3
speechless-coding-7b-16k-tora 54.9% 38.2% 897.5
open-hermes-2.5-code-290k-13b 54.3% 37.6% 898.8
deepseek-coder-33b 51.2% 34.0% 865.6
wizardcoder-7b 50.6% 32.2% 858.3
phi-2 49.4% 31.6% 851.6
code-llama-multi-34b 48.2% 28.6% 828.7
mistral-7b-codealpaca 48.2% 29.8% 834.4
speechless-mistral-7b 48.2% 27.2% 813.7
starcoder2-15b-oci 47.0% 26.7% 806.5
mixtral-8x7b-instruct 45.1% 26.3% 796.5
codegemma-7b 44.5% 27.5% 798.9
solar-10.7b-instruct 43.3% 23.0% 772.1
gemma-1.1-7b-it 42.7% 21.4% 758.3
mistralai--Mistral-7B-Instruct-v0.2 42.1% 21.9% 767.1
xdan-l1-chat 40.2% 19.2% 738.1
code-llama-multi-13b 37.8% 16.6% 698.0
octocoder 37.2% 15.7% 694.3
python-code-13b 32.9% 12.2% 636.7